Introduction


Plots can provide a useful visual summary of the data. Sometimes, a nice plot or two is all that is need for statistical analysis. In this document, we cover a basic overview of creating some plots in R.

Here’s a link to a more thorough coverage of plotting in R: https://r-graph-gallery.com/index.html.

Help Documentation


The plotting functions introduced in this document have robust help documentation with lots of options to customize your plots. If you want to view help documentation for any of the functions used in this document, run commands such?hist, ?plot, ?table, and so on.

Loading Packages and Data in R


To demonstrate how to create common statistical plots in R, we will use the storms dataset which is located in the package dplyr.

# load the library of function and data in dplyr
library(dplyr)

The package dplyr contains a dataset called storms. Let’s check out the documentation for the data.

?storms  # must load dplyr
# See a summary of all variables
summary(storms)
##      name                year          month             day       
##  Length:11859       Min.   :1975   Min.   : 1.000   Min.   : 1.00  
##  Class :character   1st Qu.:1992   1st Qu.: 8.000   1st Qu.: 8.00  
##  Mode  :character   Median :2002   Median : 9.000   Median :16.00  
##                     Mean   :2001   Mean   : 8.785   Mean   :15.83  
##                     3rd Qu.:2011   3rd Qu.: 9.000   3rd Qu.:24.00  
##                     Max.   :2020   Max.   :12.000   Max.   :31.00  
##                                                                    
##       hour             lat             long            status         
##  Min.   : 0.000   Min.   : 7.20   Min.   :-109.30   Length:11859      
##  1st Qu.: 6.000   1st Qu.:17.50   1st Qu.: -80.70   Class :character  
##  Median :12.000   Median :24.60   Median : -64.40   Mode  :character  
##  Mean   : 9.117   Mean   :24.76   Mean   : -64.09                     
##  3rd Qu.:18.000   3rd Qu.:31.30   3rd Qu.: -48.40                     
##  Max.   :23.000   Max.   :51.90   Max.   :  -6.00                     
##                                                                       
##  category       wind           pressure    tropicalstorm_force_diameter
##  -1:2898   Min.   : 10.00   Min.   : 882   Min.   :  0.0               
##  0 :5347   1st Qu.: 35.00   1st Qu.: 985   1st Qu.: 60.0               
##  1 :1934   Median : 45.00   Median : 999   Median :120.0               
##  2 : 749   Mean   : 53.64   Mean   : 992   Mean   :145.3               
##  3 : 434   3rd Qu.: 65.00   3rd Qu.:1006   3rd Qu.:210.0               
##  4 : 411   Max.   :160.00   Max.   :1022   Max.   :870.0               
##  5 :  86                                   NA's   :6509                
##  hurricane_force_diameter
##  Min.   :  0.00          
##  1st Qu.:  0.00          
##  Median :  0.00          
##  Mean   : 18.15          
##  3rd Qu.: 25.00          
##  Max.   :300.00          
##  NA's   :6509

One Quantitative Variable


Histograms


The hist function can be used create a histogram of a numerical vector.

hist(storms$pressure, 
     xlab = "storm pressure (in millibars)",
     main = "Distribution of Storm Pressure 1975-2020",
     breaks = 10, 
     col = "aquamarine4")

Density plots


Sometimes we may prefer to use a density plot over the histogram because the histogram is more sensitive to its options.

plot(density(storms$pressure),
     xlab = "storm pressure (in millibars)",
     main = "Distribution of Storm Pressure 1975-2020")

Boxplots


Boxplots are another useful plot for presenting the distribution of a quantitative variable using the five number summary.

boxplot(storms$pressure,
        ylab = "storm pressure (in millibars)",
        main = "Distribution of Storm Pressure 1975-2020")

Changing the Layout of Boxplots


boxplot(storms$pressure,
        xlab = "storm pressure (in millibars)",
        main = "Distribution of Storm Pressure 1975-2020",
        horizontal = TRUE)

One Qualitative Variable


Qualitative variables may be entered as characters (such as status of storm) or values (such as category).

typeof(storms$category)
## [1] "integer"
typeof(storms$status)
## [1] "character"

Caution on Using plot()


If we try to use the general plot() function, R will give its best guess at which plot makes the most sense to display the data.

  • If the data is stored as the wrong data type, plot() will not work as we might expect.
  • You will either need to use table() or convert the data type to factor first if you prefer to use plot().
plot(storms$status)

Creating Barplots with table(), prop.table() and barplot()


The command table(x) will count the number of times a value (or string of characters) occurs in x.

status.table <- table(storms$status)
status.table
## 
##           hurricane tropical depression      tropical storm 
##                3613                2898                5348

After creating a table, we can present the results visually in a bar chart.

barplot(status.table, main = "Distribution of Storm Status",
        ylab = "Frequency",
        col = "steelblue")

Pie Charts with pie()


Pie Charts can also be used to illustrate the distribution of one qualitative variable.

pie(status.table, #labels = status.table,
    main = "Distribution of Storm Status")

The command prop.table([table_name]) will count the proportion of all observations that a value (or string of characters) occurs in the table named [table_name].

is.table(storms$status)
## [1] FALSE
is.table(status.table)
## [1] TRUE
  • The command prop.table(storms$status) will return an error since storms$status is not a table.
status.prop <- prop.table(status.table)  
status.prop
## 
##           hurricane tropical depression      tropical storm 
##           0.3046631           0.2443714           0.4509655
barplot(status.prop, main = "Relative Frequency of Storm Status",
        ylab = "Proportion",
        col = "steelblue")

Converting to Categorical Data with factor() and Using plot()


  • One common issue with a categorical variable is that it is often stored as a data type such as characters or integers.
  • We would like observations with the same values to be group together.
  • Categorical data should be stored as a factor in R.
storms.fac <- storms  # creates a second copy so we don't overwrite storms

storms.fac$status <- factor(storms$status)  # convert status to factor
summary(storms.fac$status)  # get new summary of factors
##           hurricane tropical depression      tropical storm 
##                3613                2898                5348
  • Once the variable status is converted to a factor, we can plot the data using plot(). R assumes factors are best displayed with a bar chart.
plot(storms.fac$status,
     main = "Distribution of Storms Status 1975-2020",
     xlab = "Storm Status",
     ylab = "Frequency", 
     col = "steelblue")

Relationship Between One Quantitative and One Qualitative Variable


Subsetting Data by Category


Imagine we would like to compare the wind speeds of tropical storms, hurricanes, and tropical depressions.

  • Recall the data frame storms has variable status stored as characters (not factors).
  • We can split the data frame into three separate data frames, one for each status of storm.
  • Then make three separate boxplots of the wind speeds for each status.
# split data by storm status
hur <- filter(storms, status == "hurricane")
trop.storm <- filter(storms, status == "tropical storm")
trop.dep <- filter(storms, status == "tropical depression")

# Create side by side boxplot
boxplot(hur$wind, trop.storm$wind, trop.dep$wind, 
        main = "Windspeed of Storms (knots)", 
        names = c("Hurricanes", "Tropical Storms", "Tropical Depressions"), 
        col = c("red", "blue", "green"), 
        horizontal = TRUE)

Converting to Factors and Using plot()


  • One common issue with a categorical variable is that it is often stored as a data type such as characters or integers.
  • We would like observations with the same values to be group together.
  • Categorical data should be stored as a factor in R.
storms.fac <- storms  # creates a second copy so we don't overwrite storms
storms.fac$status <- factor(storms$status)  # convert status to factor
  • Once the variable status is converted to a factor, R will assume we want to plot each class of the factor on a separate plot. Since wind is numeric, plot() will create a boxplot for each category.
  • One advantage of converting to factors first is we do not need to subset the data!
plot(wind ~ status, data = storms.fac, 
     col = c("forestgreen", "blueviolet", "azure4"))

Relationship Between Two Qualitative Variables


Creating Grouped Frequency Barplots with table(), prop.table() and barplot()


Grouped Frequency Barplots


The command table(x) will count the number of times a value (or string of characters) occurs in x.

con.table <- table(storms$status, storms$month)
con.table
##                      
##                          1    4    5    6    7    8    9   10   11   12
##   hurricane              5    0    0   18  100  776 1793  701  187   33
##   tropical depression    2    0   40  169  349  717 1085  379  136   21
##   tropical storm        23   13   50  187  474 1214 2016  913  387   71

After creating a table, we can present the results visually in a group bar chart.

barplot(con.table, beside = TRUE,  # groups side-by-side
        legend=rownames(con.table),
        main = "Storm Types Each Month: 1975-2020",
        xlab = "Month",
        ylab = "Frequency")

Grouped Frequency Barplots


  • Note beside = FALSE is the default.
  • If we do not specify a beside option, a stacked barplot is created.
barplot(con.table,  # groups side-by-side
        legend=rownames(con.table),
        main = "Storm Types Each Month: 1975-2020",
        xlab = "Month",
        ylab = "Frequency")

Creating Stacked Relative Frequency Barplots with table(), prop.table() and barplot()


Creating Joint Distribution Tables

  • First we create a contingency table using table(x, y).
  • Then we use prop.table([table_name]) to convert to frequencies to proportions out of the grand total.
con.prop <- prop.table(con.table)  # contingency table percentage of all observations
con.prop
##                      
##                                  1            4            5            6
##   hurricane           0.0004216207 0.0000000000 0.0000000000 0.0015178346
##   tropical depression 0.0001686483 0.0000000000 0.0033729657 0.0142507800
##   tropical storm      0.0019394553 0.0010962138 0.0042162071 0.0157686146
##                      
##                                  7            8            9           10
##   hurricane           0.0084324142 0.0654355342 0.1511931866 0.0591112235
##   tropical depression 0.0294291256 0.0604604098 0.0914916941 0.0319588498
##   tropical storm      0.0399696433 0.1023695084 0.1699974703 0.0769879416
##                      
##                                 11           12
##   hurricane           0.0157686146 0.0027826967
##   tropical depression 0.0114680833 0.0017708070
##   tropical storm      0.0326334430 0.0059870141

Often, we would like the proportions in the table to be computed out of the total in each row (instead of the grand total).

  • We add the option 1 inside prop.table().
con.prop.row <- prop.table(con.table,1)  # contingency table percentage row total
con.prop.row
##                      
##                                  1            4            5            6
##   hurricane           0.0013838915 0.0000000000 0.0000000000 0.0049820094
##   tropical depression 0.0006901311 0.0000000000 0.0138026225 0.0583160801
##   tropical storm      0.0043006731 0.0024308153 0.0093492895 0.0349663426
##                      
##                                  7            8            9           10
##   hurricane           0.0276778301 0.2147799613 0.4962634929 0.1940215887
##   tropical depression 0.1204278813 0.2474120083 0.3743961353 0.1307798482
##   tropical storm      0.0886312640 0.2270007479 0.3769633508 0.1707180254
##                      
##                                 11           12
##   hurricane           0.0517575422 0.0091336839
##   tropical depression 0.0469289165 0.0072463768
##   tropical storm      0.0723635004 0.0132759910

Other times, we would like the proportions in the table to be computed out of the total in each column.

  • We add the option 2 inside prop.table().
con.prop.col <- prop.table(con.table,2)  # contingency table percentage column total
con.prop.col
##                      
##                                1          4          5          6          7
##   hurricane           0.16666667 0.00000000 0.00000000 0.04812834 0.10834236
##   tropical depression 0.06666667 0.00000000 0.44444444 0.45187166 0.37811484
##   tropical storm      0.76666667 1.00000000 0.55555556 0.50000000 0.51354280
##                      
##                                8          9         10         11         12
##   hurricane           0.28666420 0.36636698 0.35173106 0.26338028 0.26400000
##   tropical depression 0.26486886 0.22170004 0.19016558 0.19154930 0.16800000
##   tropical storm      0.44846694 0.41193298 0.45810336 0.54507042 0.56800000

Creating a Barplot from a Contingency Table


Now we can create a stacked relative frequency barplot.

barplot(con.prop.col,
        legend=rownames(con.prop.col),
        main = "Storm Types Each Month: 1975-2020",
        xlab = "Month",
        ylab = "Proportion")

Converting to Factors and Using plot()


  • One common issue with a categorical variable is that it is often stored as a data type such as characters or integers.
  • We would like observations with the same values to be group together.
  • Categorical data should be stored as a factor in R.
storms.fac <- storms  # creates a second copy so we don't overwrite storms
storms.fac$status <- factor(storms$status)  # convert status to factor
storms.fac$month <- factor(storms$month)  # convert month to factor
  • Once the variables status and month are converted to factors, the plot() function in R will recognize the data types and generate a stacked barplot by default.
  • One advantage of converting to factors first is we do not need to create tables!
  • The graph below is not so clear, but you can adjust using options.
plot(status ~ month, data = storms.fac, 
     col = c("forestgreen", "blueviolet", "azure4"))

Relationship Between Two Quantitative Variables


Bivariate scatter plots can be used to identify the relationship between two quantitative variables.

plot(wind ~ pressure, data = storms,
     main = "Relation of Pressure and Wind Speed of Storms",
     xlab = "Pressure (in millibars)",
     ylab = "Wind Speed (in knots)")

Arranging Multiple Plots in an Array


par(mfrow = c(2, 2))  # Create a 2 x 2 array of plots

# The next 5 plots created will be arranged in the array
boxplot(storms$wind)  # create boxplot of wind speed

# Code below creates a histogram of wind speed
# We can add many options to customize
hist(storms$wind, xlab = "wind speed (in knots)",   # x-axis label
     ylab = "Frequency",  # y-axis label
     main = "Distribution of Storm Wind Speed 1975-2020",  # main label
     col = "steelblue")  # change color of bars

plot(storms.fac$status, col = "gold")  # plots status, which is categorical

plot(wind ~ pressure, data = storms)  # plots two numerical variables

par(mfrow = c(1, 1))   # change settings so one image displayed in a window

# Compare numerical wind speed for different categories of storms
plot(wind ~ status, data = storms.fac, col = "springgreen4")

Optional: More Advanced Plots with ggplot2


The previous plots were created using R’s base graphics system.

A fancier alternative is to construct plots using the ggplot2 package.

In its simplest form, to construct a (useful) plot in ggplot2, you need to provide:

Loading ggplot2


library(ggplot2)  # make sure you have installed ggplot2 package

Plotting One Numerical Variable with ggplot2


ggplot(storms, aes(x = wind)) + 
  geom_histogram(fill = "steelblue", color="black")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(storms, aes(x = wind)) + 
  geom_density(color="red") + 
  theme_bw() # adding theme_bw()  makes white background

# adding theme_bw()  makes white background
ggplot(storms, aes(x = wind)) + 
  geom_boxplot(color="black", fill="blueviolet")

Scatter Plots with ggplot2


ggplot(storms) + 
  geom_point(aes(x = pressure, y = wind))

Scaling ggplot2 plots


In general, scaling is the process by which ggplot2 maps variables to unique values. When this is done for discrete or categorical variables, ggplot2 will often scale the variable to distinct colors, symbols, or sizes, depending on the aesthetic mapped.

In the example below, we map the status variable to the color aesthetic, which is then scaled to different colors for the different status levels.

ggplot(storms) + 
  geom_point(aes(x = pressure, y = wind, color = status))

Alternatively, we can map the status variable to the shape aesthetic, which creates a plot with different shapes for each observation based on the status level.

ggplot(storms) + 
  geom_point(aes(x = pressure, y = wind, shape = status))

We can even combine these two aesthetic mappings in a single plot to get different colors and symbols for each level of month and status, respectively.

ggplot(storms) + 
  geom_point(aes(x = pressure, y = wind, color = month, shape = status))

Facetting in ggplot2


Faceting creates separate panels (facets) of a data frame based on one or more faceting variables.

Below, we facet the data by the category.

ggplot(storms) + 
  geom_point(aes(x = pressure, y = wind)) + 
  facet_grid(~ category)

Grouped Barplots with ggplot2


# stacks bars on top of each other 
ggplot(storms, aes(x=month)) + 
    geom_bar(aes(fill=status), stat = "count", position="stack") + 
    ggtitle("Occurrence of Storms by Month: 1975-2020")

# side by side groupings
ggplot(storms, aes(x=month)) + 
    geom_bar(aes(fill=status), stat = "count", position="dodge") +  
    ggtitle("Occurrence of Storms by Month: 1975-2020")

# stacks bars and standardizing each stack
ggplot(storms, aes(x=month)) + 
    geom_bar(aes(fill=status), stat = "count", position="fill") +  
    ggtitle("Occurrence of Storms by Month: 1975-2020")

Optional: Spatial Plots with mapview


Load Library


library(mapview)  # load spatial mapping package

Mapping All Storms by Status


mapview(storms, xcol = "long", ycol = "lat", 
        zcol = "status", 
        crs = 4269, grid = FALSE)

Mapping Category 5 Hurricanes


First we filter out observations with category equal to 5.

cat5 <- subset(storms , category == "5")  # keep only category 5
mapview(cat5, xcol = "long", ycol = "lat", cex = "wind", crs = 4269, grid = FALSE)
mapview(cat5, xcol = "long", ycol = "lat", zcol = "name", cex = "wind", crs = 4269, grid = FALSE)